Background backend liveness
Monitor: View in Axiom
Service: backend-background (AWS) / backend (GCP)
Overviewâ
This monitor tracks the self-reported heartbeat from the background backend service. A triggered alert means no heartbeats have been received in the last 2 minutes for the alerted region.
Architecture note: We operate two distinct backend services:
backendâ API-facing servicebackend-backgroundâ responsible for background tasks, queue processing, pollers, and scheduled jobs
In AWS, these are deployed as separate services. In GCP, they run as a single unified backend deployment.
When this monitor fires, the cause is one of two things:
- A momentary blip â transient disruption, usually self-resolving
- An ongoing outage â
backend-backgroundis down, or the heartbeat reporting mechanism has failed
ð Step 1: Determine if the Issue is Momentary or Ongoingâ
Open the monitor in Axiom and examine the heartbeat history for the alerted region.
- If heartbeats resumed on their own â this was a transient issue. Document the event and continue monitoring. No further action required.
- If heartbeats are still missing â proceed to Step 2.
ðĐš Step 2: Assess Impactâ
2a. Are Pollers Running?â
Run the following query in Axiom to determine whether background jobs are still executing in the affected region:
traces
| where ['service.name'] == "legion-backend"
| where ['resource.deployment.environment'] == "production"
| where name == "run_job"
| Poller traces present? | Likely cause | Next step |
|---|---|---|
| â Yes | Service is running, but heartbeat reporting has failed | â Step 3: Investigate the heartbeat code path |
| â No | backend-background may be down entirely | â Step 4: Investigate the service health |
2b. Are Queue Readers Healthy?â
Check the health of the message queues for the affected region:
- AWS â Navigate to SQS in the relevant region and inspect the queues for message backlog, age of oldest message, and consumer activity.
- GCP â Navigate to Pub/Sub and check the relevant subscriptions for undelivered message count and consumer activity.
A growing backlog with no consumer activity is a strong signal that the backend-background service is down or stuck.
ð§ Step 3: Heartbeat Flow Investigation (Pollers Running, Heartbeats Missing)â
â ïļ Reaching this step is uncommon. The
heartbeat()function is a simple while loop â a single log write followed by a sleep â so there is little room for error. If pollers are running but heartbeats are missing, the most likely explanation is a stalled or blocked event loop rather than a bug in the heartbeat logic itself. This likely requires deeper runtime investigation (e.g., event loop analysis, thread/async inspection).
ðĻ Step 4: backend-background Service Is Downâ
If no poller traces are found, the backend-background service is likely unhealthy or fully down.
Check the service health for the relevant region:
- AWS â Navigate to the
backend-backgroundservice in ECS and inspect running task count, recent failures, container restarts, and CPU/memory utilization. - GCP â Check the
backendCloud Run service for instance health and recent errors. Note: GCP runs a single unifiedbackenddeployment with no separatebackend-background.
If the service is clearly unhealthy, restart it following the Restarting Backend Service guide.
â ïļ Restarting restores availability â but you still need to understand why it went down. Continue to Step 5.
ðŠĩ Step 5: Examine Logsâ
Regardless of the resolution path, investigate logs around the incident timeframe to identify the root cause.
See View Container Logs for instructions on accessing logs in both AWS and GCP.
In both cases, look for:
- Exceptions or stack traces
- OOM (out of memory) kills
- Service crashes or restart loops
- Connectivity errors (DB, queue, downstream services)
Use findings to document the root cause and open a follow-up ticket if a code fix, capacity change, or dependency investigation is needed.